3 research outputs found
Quantifying the Performance Benefits of Partitioned Communication in MPI
Partitioned communication was introduced in MPI 4.0 as a user-friendly
interface to support pipelined communication patterns, particularly common in
the context of MPI+threads. It provides the user with the ability to divide a
global buffer into smaller independent chunks, called partitions, which can
then be communicated independently. In this work we first model the performance
gain that can be expected when using partitioned communication. Next, we
describe the improvements we made to \mpich{} to enable those gains and provide
a high-quality implementation of MPI partitioned communication. We then
evaluate partitioned communication in various common use cases and assess the
performance in comparison with other MPI point-to-point and one-sided
approaches. Specifically, we first investigate two scenarios commonly
encountered for small partition sizes in a multithreaded environment: thread
contention and overhead of using many partitions. We propose two solutions to
alleviate the measured penalty and demonstrate their use. We then focus on
large messages and the gain obtained when exploiting the delay resulting from
computations or load imbalance. We conclude with our perspectives on the
benefits of partitioned communication and the various results obtained
C-Coll: Introducing Error-bounded Lossy Compression into MPI Collectives
With the ever-increasing computing power of supercomputers and the growing
scale of scientific applications, the efficiency of MPI collective
communications turns out to be a critical bottleneck in large-scale distributed
and parallel processing. Large message size in MPI collectives is a
particularly big concern because it may significantly delay the overall
parallel performance. To address this issue, prior research simply applies the
off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to
suboptimal performance, limited generalizability, and unbounded errors. In this
paper, we propose a novel solution, called C-Coll, which leverages
error-bounded lossy compression to significantly reduce the message size,
resulting in a substantial reduction in communication cost. The key
contributions are three-fold. (1) We develop two general, optimized
lossy-compression-based frameworks for both types of MPI collectives
(collective data movement as well as collective computation), based on their
particular characteristics. Our framework not only reduces communication cost
but also preserves data accuracy. (2) We customize an optimized version based
on SZx, an ultra-fast error-bounded lossy compressor, which can meet the
specific needs of collective communication. (3) We integrate C-Coll into
multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and
perform a comprehensive evaluation based on real-world scientific datasets.
Experiments show that our solution outperforms the original MPI collectives as
well as multiple baselines and related efforts by 3.5-9.7X.Comment: 12 pages, 15 figures, 5 tables, submitted to SC '2